Introduction to Open Data Science - Course Project


How are you feeling right now? What do you expect to learn? Where did you hear about the course?


Also reflect on your learning experiences with the R for Health Data Science book and the Exercise Set 1: How did it work as a “crash course” on modern R tools and using RStudio? Which were your favorite topics? Which topics were most difficult? Some other comments on the book and our new approach of getting started with R Markdown etc.?


My GitHub repository


## [1] "The date today is"
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
## [1] "2023-12-11 14:27:31 EET"

The text does not continue here


Assignment 2: Data wrangling and analysis

Anne Tyvijärvi

The learning2014 data is based on the international survey of Approaches to Learning. Observations with “zero” value for exam points have been removed from the data. “Deep”, “stra” and “surf” are combination variables (averages) of individual measurements measuring the same “dimension”.The data frame includes 166 rows and 7 columns, and all data (except gender that is “character”) is numeric.


With the function plot() you can see that the majority of participants identified as female, and most participants were 20 - 30 years. There was a strong positive correlation between attitude and exam points (p < 0.001), and a negative correlation between deep learning (questions related to measuring the understanding of what is being studied) and attitude (p < 0.05), but also deep learning and learning strategy (making an effort to learn).

  • Based on the above results, I chose attitude, deep (relating to the “depth” of learning) and stra (learning strategies) as explanatory varibles for exam points, and tested these with the linear regression: + Model fit: The distribution of residuals not symmetrically distributed across 0, thus model fit is not necessarily good

  • t-value for points and attitude is 6.203, which indicates a possible relationship between the two variables. This is also indicated in the coefficient column (PR(>t)), where p < 0.001

  • t-value for deep and stra is close to zero, so likely there is no relationship between points and deep/stra (although p < 0.1 for stra, so there might be a tendency of learning strategies having a relationship with exam points)

  • R2 for the model is 0.2097, so 20.97% of the variance of the response variable (points) could be explained by the predictor variables (attitude, deep and stra)

  • Based on the above, the variables deep and stra were removed from the next regression analysis (See below for results).

  • Model fit: The distribution of residuals more symmetrical than when the other two variables were included in the model => better fit

  • t-value for attitude is 6.124 and p < 0.001, indicating a strong relationship between exam points and attitude

  • R2 (multiple R squared) is 0.1906, so 19.06% of the variance in points was explained by attitude alone


Next, I tested if the model meets the assumptions of linear regression (i.e., linearity, independence, homoscedasticity, normality, no multicolinearity and no endogeneity). The “Residuals vs.fitted” plot can be used to detect non-linearity, unequal error variances, and outliers.In our data, 145, 56 and 35 seem to be outliers (but other than that the residuals are quite evenly distributed around 0). The “Q-Q Residuals” plot is used to check for normality of residuals. Here you can also see the outliers of our data. The “Residuals vs. Leverage” can be used to check for homoscedasticity and non-linearity. With our data, the spread of standardized variables increases as a function of leverage, indicating heteroscedasticity. Based on these results I would either remove outliers from the data or try data transformations (log10, square root..) to meet the assumptions of the linear regression model.

library(GGally)
## Warning: package 'GGally' was built under R version 4.3.2
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr   1.1.3     ✔ stringr 1.5.0
## ✔ forcats 1.0.0     ✔ tibble  3.2.1
## ✔ purrr   1.0.2     ✔ tidyr   1.3.0
## ✔ readr   2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)

#Set the working directory and read the data into R

setwd("C:/Users/03114911/OneDrive - Valtion/Anne's PhD papers, results, plans etc/MBDP/Open data science/IODS-project")
learning2014_readback <- read_csv("Data/learning2014.csv", show_col_types = FALSE)

#Data structure and dimensions: these allow you to learn basic information of your data
str(learning2014_readback) # all data is numeric, except gender is "character"("M"/"F")
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gender  : chr [1:166] "F" "M" "F" "M" ...
##  $ age     : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   gender = col_character(),
##   ..   age = col_double(),
##   ..   attitude = col_double(),
##   ..   deep = col_double(),
##   ..   stra = col_double(),
##   ..   surf = col_double(),
##   ..   points = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
dim(learning2014_readback) # the data has 166 rows and 7 columns
## [1] 166   7
#Plot the relationships between variables in learning2014_readback to see how the variables are related and if there are any significant relationships. Alpha set to 0.1, only showing significance p > 0.1.
plot <- ggpairs(learning2014_readback, mapping = aes(col = gender, alpha = 0.1))
plot
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Multiple regression
model1 <- lm(points ~ attitude + deep + stra, data = learning2014_readback)
summary(model1)
## 
## Call:
## lm(formula = points ~ attitude + deep + stra, data = learning2014_readback)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5239  -3.4276   0.5474   3.8220  11.5112 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.3915     3.4077   3.343  0.00103 ** 
## attitude      3.5254     0.5683   6.203 4.44e-09 ***
## deep         -0.7492     0.7507  -0.998  0.31974    
## stra          0.9621     0.5367   1.793  0.07489 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared:  0.2097, Adjusted R-squared:  0.195 
## F-statistic: 14.33 on 3 and 162 DF,  p-value: 2.521e-08
model2 <- lm(points ~ attitude, data = learning2014_readback)
summary(model2)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014_readback)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09
# Testing whether the model meets the assumptions of a linear regression
par(mfrow = c(2,2))
plot(model2, which = c(1,2,5))

date()
## [1] "Mon Dec 11 14:27:40 2023"

Here we go again…

Assignment 3

Anne Tyvijärvi

This data includes information on student alcohol consumption, personal identifiers (age, sex, etc.), information about the family (education of parents, family size, etc.) and variables measuring the student’s success in school.

##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

I decided to study the relationships between low/high alcohol consumption and parental education (“Medu” for mother’s education and “Fedu” for father’s education), student’s current health status (“health”) and quality of family relationships (“famrel”).

I hypothesize that

  1. low level of education of either parent is associated with higher alcohol consumption in students
  2. the health status of a student is associated with higher alcohol consumption
  3. frequent alcohol use is associated with the quality of the relationship with the family

The mean age of the students was 16.6 years. The mean score for alcohol consumption (on a scale of 1 to 5, where 1 = very low consumption and 5 = very high consumption) was 1.9, indicating a moderate level of alcohol consumption across the data set. The mean scores for the education levels of the students mother and father were 2.8 and 2.6, indicating a slightly higher level of education for the mothers compared to fathers. Over all, the students health levels were really good (mean score 3.6), and in addition, the quality of the family relationships was really good (mean score of 3.9). The count summaries support the above findings.

## [1] 16.6
## [1] 1.9
## [1] 2.8
## [1] 2.6
## [1] 3.6
## [1] 3.9
## # A tibble: 7 × 2
##     age count
##   <dbl> <int>
## 1    15    81
## 2    16   102
## 3    17    97
## 4    18    77
## 5    19    11
## 6    20     1
## 7    22     1
## # A tibble: 9 × 2
##   alc_use count
##     <dbl> <int>
## 1     1     140
## 2     1.5    63
## 3     2      56
## 4     2.5    41
## 5     3      32
## 6     3.5    17
## 7     4       9
## 8     4.5     3
## 9     5       9
## # A tibble: 5 × 2
##    Medu count
##   <dbl> <int>
## 1     0     3
## 2     1    49
## 3     2    96
## 4     3    93
## 5     4   129
## # A tibble: 5 × 2
##    Fedu count
##   <dbl> <int>
## 1     0     2
## 2     1    73
## 3     2   105
## 4     3    97
## 5     4    93
## # A tibble: 5 × 2
##   health count
##    <dbl> <int>
## 1      1    46
## 2      2    42
## 3      3    80
## 4      4    62
## 5      5   140
## # A tibble: 5 × 2
##   famrel count
##    <dbl> <int>
## 1      1     8
## 2      2    18
## 3      3    64
## 4      4   180
## 5      5   100

Below, the charts show that in general, alcohol use seemed to be more frequent where either parent had a higher education. This is opposite to my first hypothesis.

In addition, in contrast to my second and third hypotheses, either poor health score or poor family relationships did not seem to be associated with more frequent alcohol use.

Now, let’s see if there are statistically significant differences associated with the variables that I’ve chosen and high alcohol use. Let’s set the “family = binomial” to let R know that the dependent variable high_use is binary (TRUE/FALSE).

Summary of the model shows a significant (p < 0.05 ) relationship between the quality of family relationships and high alcohol use. In addition, there is a trend of the health score being associated with high alcohol use (p < 0.1).

## 
## Call:
## glm(formula = high_use ~ Medu + Fedu + health + famrel, family = "binomial", 
##     data = alc)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.18242    0.63354  -0.288   0.7734  
## Medu        -0.01537    0.13745  -0.112   0.9110  
## Fedu         0.02092    0.13709   0.153   0.8787  
## health       0.14777    0.08482   1.742   0.0815 .
## famrel      -0.31087    0.12454  -2.496   0.0126 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 443.49  on 365  degrees of freedom
## AIC: 453.49
## 
## Number of Fisher Scoring iterations: 4
## (Intercept)        Medu        Fedu      health      famrel 
## -0.18242217 -0.01536970  0.02092328  0.14777237 -0.31086747

Since the education of either parent does not explain high alcohol use, let’s remove them from the model see and see how that looks. Let’s also look at the coefficients/odd ratios and provide confidence intervals for them.

The summary shows a small increase/improvement in the p - value for “health”. From the coefficients and confidence interval output we can see that the odds of variable “health” is ~1.16, meaning that exposure to “high_use” was associated with higher odds of outcome for health. For family relationships (famrel), the odd ratio was 0.7 meaning that the exposure to “high_use” was associated with lower odds of outcome for the quality of family relationships. Around 30% of the variance in “health odd ratio” can be explained with 97,5% confidence interval (which is OK?), but for famrel the value is below zero (which is not good, I assume :D)

## 
## Call:
## glm(formula = high_use ~ health + famrel, family = "binomial", 
##     data = alc)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.17461    0.54626  -0.320   0.7492  
## health       0.14850    0.08465   1.754   0.0794 .
## famrel      -0.31083    0.12454  -2.496   0.0126 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 443.51  on 367  degrees of freedom
## AIC: 449.51
## 
## Number of Fisher Scoring iterations: 4
##             odd_ratios       2.5 %      97.5 %
## (Intercept)  0.8397865 -1.25837614  0.89223235
## health       1.1600907 -0.01507081  0.31752786
## famrel       0.7328380 -0.55725045 -0.06718336

Next, let’s explore the predictive power of our model glm2. The prediction got 256 (out of possible 259) FALSE observations “right”, so here the prediction works quite efficiently. However, when it comes to TRUE observations in the data, the prediction was not able to predict any of them (so the “result” is 0/111, which to me, sounds bad). Since I’m not a data scientist and this is my first time performing such analysis, I’m not quite sure how to interpret this. I’m sure (?) something in our model is not as is should be (I know at least that the variables health and famrel are not anywhere near normally distributed, so maybe that is one thing..?). The average number of incorrect predictions with probability 0 was 0.3, and with probability 1 it was 0.7.This means that when we have a binary (TRUE/FALSE and “positive”/“negative”) problem, any prediction that is set with the probability of 0 is the prediction for “FALSE” or “negative” => in our case, the average number of incorrect predictions for “FALSE” is 0.3, where as for “TRUE” it is 0.7. This was also reflected in the cross tabulation where the prediction failed with all TRUE predictions.

##         prediction
## high_use FALSE TRUE
##    FALSE   256    3
##    TRUE    111    0
## [1] 0.3
## [1] 0.7

Regarding the three hypothesis set in the beginning, it looked like student health and the quality of family relationships were associated with alcohol use, so my last two hypothese were confirmed. However, I found no support for the first hypothesis, and it looks like the education level of either parent is not associated with alcohol consumption in students.Based on the odd ratios and predictions my glm model did not really work well, but the time allocated for these exercises did not allow me to further explore what was “wrong” with it.

As final words I would like to declare that I put a lot of time and effort in these exercises, and although I might have not interpreted the results “correctly” and made a few mistakes, I really tried my best and gave it a 100 % effort. :) I hope you (whoever grades this) appreciate the effort I made (being new to all these things shows…) :)

Assignment 4

Anne Tyvijärvi


The Boston data frame contains information related to the housing values in the suburbs of Boston. The data consists of 506 rows and 14 columns.

## Rows: 506
## Columns: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

From the data summary we can see that, for example, the average crime rate per capita (“crim”) is 3.6, and that the median value of owner-occupied homes (“medv”) is around 22.5k$. As the summary and the boxplots also show, the majority of owner-occupied units were built before the 1940s (“age”).

## Warning: package 'corrplot' was built under R version 4.3.2
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

The correlation plot (alpha = 0.001) shows significant negative correlations between the distance to employment centres (“dis”) and the proportion of non-retail business (“indus”), nitrogen oxide concentrations (“nox”), and the proportion of units built before the 1940s (“age”). There is a strong positive correlation (p < 0.001) between the property tax-rate (“tax”, full-value property-tax rate per $10,000) and accessability to radial highways (“rad”).

Let’s scale the data and print the summary of the scaled data. => All means are set to 0.00.

Next, let’s create a categorial variable from the scaled crime rate in Boston dataset => let’s use quantiles as break points and drop the old crime rate from the data set.

##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
## [1] FALSE
## [1] "numeric"
##  [1] "zn"      "indus"   "chas"    "nox"     "rm"      "age"     "dis"    
##  [8] "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"    "crime"

Then, let’s divide the data to “train” and “test” sets, and include 80 % of the data to “train” (and the rest to “test”). Finally, let’s fit an LDA biplot (where crime rate is the target variable and all other variables are predictor variables.) Let’s save the correct classes from “crime” data and predict new classes with the LDA.

From the prediction we can see that the prediction was correct for 15/23 in low, 15/29 in med_low, 16/19 in med_high and 31/31 in high. It looks like the model works better when predicting higher values (med_high or high, accuracy ~85-100%), where as for lower values the predictions were less accurate (med_low and low, accurary ~51-65%).

## integer(0)
##           predicted
## correct    low med_low med_high high
##   low       17      12        2    0
##   med_low    0      13        5    0
##   med_high   0       7       18    3
##   high       0       0        0   25

Now, let’s reload the Boston dataset and standardize it. Let’s calculate the distances between the observations and print a summary of them. Then run the K-means algorithm and determine the optimal number of clusters before re-running the algorithm based on the optimal clustering number. (First run with 3 clusters and then 2 according to the optimization.) Let’s plot the clusters using pairs().

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Let’s try the super-bonus..

… OK, something weird is happening here with the second plot but I can’t figure out what it is. Let’s just forget about the super-bonus. :’(

## [1] 404  13
## [1] 13  3
## Warning: package 'plotly' was built under R version 4.3.2

Anne Tyvijärvi

Assignment 5


The data contains information of

GNI = Gross National Income per capita
Life.Exp = Life expectancy at birth
Edu.Exp = Expected years of schooling
Mat.Mor = Maternal mortality ratio
Ado.Birth = Adolescent birth rate
Parli.F = Percetange of female representatives in parliament
Edu2.FM = Edu2.F / Edu2.M (Edu2.F” = Proportion of females with at least secondary education, “Edu2.M” = Proportion of males with at least secondary education)
Labo.FM = Labo2.F / Labo2.M(“Labo.F” = Proportion of females in the labour force/“Labo.M” ” Proportion of males in the labour force)
Country = country :)

From the summary we can see e.g., that the mean life expectancy at birth across the whole data set is ~72 years, and expected years of schooling is ~13 years. Also, the scale of the variables varies from 10^-2 to 10^5, which might get us in trouble later :D.

Looking at the graphical summary of the data, only Edu.Exp looks like it is normally distributed. There are multiple significant correlations between the variables. To name a few, the maternal mortality ratio has a strong negative correlation with the life expectancy at birth and a strong positive correlation with adolescent birth rate.

## Rows: 155
## Columns: 9
## $ Country   <chr> "Norway", "Australia", "Switzerland", "Denmark", "Netherland…
## $ Edu2.FM   <dbl> 1.0072389, 0.9968288, 0.9834369, 0.9886128, 0.9690608, 0.992…
## $ Labo.FM   <dbl> 0.8908297, 0.8189415, 0.8251001, 0.8840361, 0.8286119, 0.807…
## $ Edu.Exp   <dbl> 17.5, 20.2, 15.8, 18.7, 17.9, 16.5, 18.6, 16.5, 15.9, 19.2, …
## $ Life.Exp  <dbl> 81.6, 82.4, 83.0, 80.2, 81.6, 80.9, 80.9, 79.1, 82.0, 81.8, …
## $ GNI       <dbl> 64992, 42261, 56431, 44025, 45435, 43919, 39568, 52947, 4215…
## $ Mat.Mor   <dbl> 4, 6, 6, 5, 6, 7, 9, 28, 11, 8, 6, 4, 8, 4, 27, 2, 11, 6, 6,…
## $ Ado.Birth <dbl> 7.8, 12.1, 1.9, 5.1, 6.2, 3.8, 8.2, 31.0, 14.5, 25.3, 6.0, 6…
## $ Parli.F   <dbl> 39.6, 30.5, 28.5, 38.0, 36.9, 36.9, 19.9, 19.4, 28.2, 31.4, …
##    Ado.Birth           GNI            Edu.Exp         Life.Exp    
##  Min.   :  0.60   Min.   :   581   Min.   : 5.40   Min.   :49.00  
##  1st Qu.: 12.65   1st Qu.:  4198   1st Qu.:11.25   1st Qu.:66.30  
##  Median : 33.60   Median : 12040   Median :13.50   Median :74.20  
##  Mean   : 47.16   Mean   : 17628   Mean   :13.18   Mean   :71.65  
##  3rd Qu.: 71.95   3rd Qu.: 24512   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :204.80   Max.   :123124   Max.   :20.20   Max.   :83.50  
##     Mat.Mor          Parli.F         Edu2.FM          Labo.FM      
##  Min.   :   1.0   Min.   : 0.00   Min.   :0.1717   Min.   :0.1857  
##  1st Qu.:  11.5   1st Qu.:12.40   1st Qu.:0.7264   1st Qu.:0.5984  
##  Median :  49.0   Median :19.30   Median :0.9375   Median :0.7535  
##  Mean   : 149.1   Mean   :20.91   Mean   :0.8529   Mean   :0.7074  
##  3rd Qu.: 190.0   3rd Qu.:27.95   3rd Qu.:0.9968   3rd Qu.:0.8535  
##  Max.   :1100.0   Max.   :57.50   Max.   :1.4967   Max.   :1.0380

Let’s first run PCA with unstandardized variables.Here PC1 explained a 100% of the total variance in the variables, and is aligned with GNI (which is the result of not scaling the variables before running the PCA).

## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

Now, let’s try again with standardized variables.

We can see that together, PCA1 and PCA2 accounted for 69.8% of the total variance in the variables. It looks like the maternal mortality ratio was aligned together with adolescent birth ratio (which was also visible in the correlations), and that they were both positively correlated with PC1. Expected years of schooling, life expectancy at birth, proportion of females with at least secondary education and GNI were all aligned together and were negatively correlated with PCA1. Proportion of females in the labour force and percetange of female representatives in parliament were aligned together and were positively correlated with PCA2. The names of countries are really hard to see, but in general it looks like many west-African countries (Burkina Faso, Gambia, Benin, Sierra Leone) were aligned with maternal mortality ratio and adolescent birth rate, where as east- and south-African countries (e.g., Namibia, South Africa, Rwanda, Tanzania) were aligned together with proportion of females in the labour force and percetange of female representatives in parliament. Many European countries can be seen with no apparent connection to any of the measured variables.

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000

OK, lets’ then upload the tea data set (convert character variables to factors), and look at the structure and dimensions of the tea. There are 36 variables and 300 observations.

## Rows: 300
## Columns: 36
## $ breakfast        <fct> breakfast, breakfast, Not.breakfast, Not.breakfast, b…
## $ tea.time         <fct> Not.tea time, Not.tea time, tea time, Not.tea time, N…
## $ evening          <fct> Not.evening, Not.evening, evening, Not.evening, eveni…
## $ lunch            <fct> Not.lunch, Not.lunch, Not.lunch, Not.lunch, Not.lunch…
## $ dinner           <fct> Not.dinner, Not.dinner, dinner, dinner, Not.dinner, d…
## $ always           <fct> Not.always, Not.always, Not.always, Not.always, alway…
## $ home             <fct> home, home, home, home, home, home, home, home, home,…
## $ work             <fct> Not.work, Not.work, work, Not.work, Not.work, Not.wor…
## $ tearoom          <fct> Not.tearoom, Not.tearoom, Not.tearoom, Not.tearoom, N…
## $ friends          <fct> Not.friends, Not.friends, friends, Not.friends, Not.f…
## $ resto            <fct> Not.resto, Not.resto, resto, Not.resto, Not.resto, No…
## $ pub              <fct> Not.pub, Not.pub, Not.pub, Not.pub, Not.pub, Not.pub,…
## $ Tea              <fct> black, black, Earl Grey, Earl Grey, Earl Grey, Earl G…
## $ How              <fct> alone, milk, alone, alone, alone, alone, alone, milk,…
## $ sugar            <fct> sugar, No.sugar, No.sugar, sugar, No.sugar, No.sugar,…
## $ how              <fct> tea bag, tea bag, tea bag, tea bag, tea bag, tea bag,…
## $ where            <fct> chain store, chain store, chain store, chain store, c…
## $ price            <fct> p_unknown, p_variable, p_variable, p_variable, p_vari…
## $ age              <int> 39, 45, 47, 23, 48, 21, 37, 36, 40, 37, 32, 31, 56, 6…
## $ sex              <fct> M, F, F, M, M, M, M, F, M, M, M, M, M, M, M, M, M, F,…
## $ SPC              <fct> middle, middle, other worker, student, employee, stud…
## $ Sport            <fct> sportsman, sportsman, sportsman, Not.sportsman, sport…
## $ age_Q            <fct> 35-44, 45-59, 45-59, 15-24, 45-59, 15-24, 35-44, 35-4…
## $ frequency        <fct> 1/day, 1/day, +2/day, 1/day, +2/day, 1/day, 3 to 6/we…
## $ escape.exoticism <fct> Not.escape-exoticism, escape-exoticism, Not.escape-ex…
## $ spirituality     <fct> Not.spirituality, Not.spirituality, Not.spirituality,…
## $ healthy          <fct> healthy, healthy, healthy, healthy, Not.healthy, heal…
## $ diuretic         <fct> Not.diuretic, diuretic, diuretic, Not.diuretic, diure…
## $ friendliness     <fct> Not.friendliness, Not.friendliness, friendliness, Not…
## $ iron.absorption  <fct> Not.iron absorption, Not.iron absorption, Not.iron ab…
## $ feminine         <fct> Not.feminine, Not.feminine, Not.feminine, Not.feminin…
## $ sophisticated    <fct> Not.sophisticated, Not.sophisticated, Not.sophisticat…
## $ slimming         <fct> No.slimming, No.slimming, No.slimming, No.slimming, N…
## $ exciting         <fct> No.exciting, exciting, No.exciting, No.exciting, No.e…
## $ relaxing         <fct> No.relaxing, No.relaxing, relaxing, relaxing, relaxin…
## $ effect.on.health <fct> No.effect on health, No.effect on health, No.effect o…
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
## [1] 300  36

Whatever, let’s run an MCA on the tea data set. :D

Because there are so many variables, I’ll look at the first 10 of them to make the MCA look more clear. Let’s only plot the variables (and not individuals).

I’m not sure if this type of data is really the best for MCA, since MCA is also a dimension reduction method, and this data is quite “simple” (there’s only two levels of each variable). Based on this plot at least, I can’t see any patterns or correlations.. I would probabaly use MCA for something more complicated where it would reveal underlying patterns in the data.

## Warning: package 'FactoMineR' was built under R version 4.3.2
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...

## 
## Call:
## MCA(X = tea[1:10], ncp = 2, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.179   0.129   0.115   0.107   0.104   0.090   0.079
## % of var.             17.939  12.871  11.481  10.654  10.406   8.985   7.917
## Cumulative % of var.  17.939  30.810  42.291  52.945  63.351  72.337  80.253
##                        Dim.8   Dim.9  Dim.10
## Variance               0.076   0.068   0.054
## % of var.              7.559   6.793   5.395
## Cumulative % of var.  87.812  94.605 100.000
## 
## Individuals (the 10 first)
##                  Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## 1             | -0.543  0.549  0.474 | -0.462  0.552  0.342 |
## 2             | -0.543  0.549  0.474 | -0.462  0.552  0.342 |
## 3             | -0.068  0.009  0.002 |  0.542  0.761  0.141 |
## 4             | -1.037  1.998  0.558 |  0.264  0.180  0.036 |
## 5             | -0.235  0.103  0.061 | -0.011  0.000  0.000 |
## 6             | -1.037  1.998  0.558 |  0.264  0.180  0.036 |
## 7             |  0.024  0.001  0.001 | -0.404  0.422  0.374 |
## 8             | -0.193  0.069  0.054 |  0.058  0.009  0.005 |
## 9             | -0.258  0.124  0.117 | -0.624  1.007  0.680 |
## 10            | -0.048  0.004  0.002 | -0.153  0.061  0.020 |
## 
## Categories (the 10 first)
##                   Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## breakfast     |   0.209   1.168   0.040   3.471 |  -0.770  22.090   0.547
## Not.breakfast |  -0.193   1.078   0.040  -3.471 |   0.710  20.391   0.547
## Not.tea time  |  -0.680  11.253   0.358 -10.351 |   0.327   3.635   0.083
## tea time      |   0.527   8.723   0.358  10.351 |  -0.254   2.817   0.083
## evening       |   0.445   3.787   0.103   5.562 |   0.634  10.728   0.210
## Not.evening   |  -0.233   1.980   0.103  -5.562 |  -0.332   5.609   0.210
## lunch         |   0.947   7.329   0.154   6.788 |   0.167   0.317   0.005
## Not.lunch     |  -0.163   1.260   0.154  -6.788 |  -0.029   0.054   0.005
## dinner        |  -1.571   9.633   0.186  -7.454 |   1.044   5.925   0.082
## Not.dinner    |   0.118   0.725   0.186   7.454 |  -0.079   0.446   0.082
##                v.test  
## breakfast     -12.786 |
## Not.breakfast  12.786 |
## Not.tea time    4.983 |
## tea time       -4.983 |
## evening         7.929 |
## Not.evening    -7.929 |
## lunch           1.195 |
## Not.lunch      -1.195 |
## dinner          4.951 |
## Not.dinner     -4.951 |
## 
## Categorical variables (eta2)
##                 Dim.1 Dim.2  
## breakfast     | 0.040 0.547 |
## tea.time      | 0.358 0.083 |
## evening       | 0.103 0.210 |
## lunch         | 0.154 0.005 |
## dinner        | 0.186 0.082 |
## always        | 0.089 0.096 |
## home          | 0.008 0.114 |
## work          | 0.215 0.006 |
## tearoom       | 0.315 0.003 |
## friends       | 0.325 0.141 |

Assignment 6

Anne Tyvijärvi

Part I

Implement the analyses of Chapter 8 of MABS using the RATS data

“The rats data contains information collected from a nutrition study conducted in three groups of rats. The groups were put on different diets, and each animal’s body weight (grams) was recorded repeatedly (approximately) weekly, except in week seven when two recordings were taken) over a 9-week period.”

I’ll start by reading in both data sets needed in this final data analysis.

The rats data contains information of rat weights over a period of time. Let’s transform Group and ID as factors.

library(tidyverse)
library(lme4)
## Warning: package 'lme4' was built under R version 4.3.2
## Warning: package 'Matrix' was built under R version 4.3.2
library(ggplot2)

rats_new <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE)
BPRS_new <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", header = TRUE)

summary(rats_new)
##        ID            Group           WD1             WD8             WD15      
##  Min.   : 1.00   Min.   :1.00   Min.   :225.0   Min.   :230.0   Min.   :230.0  
##  1st Qu.: 4.75   1st Qu.:1.00   1st Qu.:252.5   1st Qu.:255.0   1st Qu.:255.0  
##  Median : 8.50   Median :1.50   Median :340.0   Median :345.0   Median :347.5  
##  Mean   : 8.50   Mean   :1.75   Mean   :365.9   Mean   :369.1   Mean   :372.5  
##  3rd Qu.:12.25   3rd Qu.:2.25   3rd Qu.:480.0   3rd Qu.:476.2   3rd Qu.:486.2  
##  Max.   :16.00   Max.   :3.00   Max.   :555.0   Max.   :560.0   Max.   :565.0  
##       WD22            WD29            WD36            WD43      
##  Min.   :232.0   Min.   :240.0   Min.   :240.0   Min.   :243.0  
##  1st Qu.:267.2   1st Qu.:268.8   1st Qu.:267.2   1st Qu.:269.2  
##  Median :351.5   Median :356.5   Median :360.0   Median :360.0  
##  Mean   :379.2   Mean   :383.9   Mean   :387.0   Mean   :386.0  
##  3rd Qu.:492.5   3rd Qu.:497.8   3rd Qu.:504.2   3rd Qu.:501.0  
##  Max.   :580.0   Max.   :590.0   Max.   :597.0   Max.   :595.0  
##       WD44            WD50            WD57            WD64      
##  Min.   :244.0   Min.   :238.0   Min.   :247.0   Min.   :245.0  
##  1st Qu.:270.0   1st Qu.:273.8   1st Qu.:273.8   1st Qu.:278.0  
##  Median :362.0   Median :370.0   Median :373.5   Median :378.0  
##  Mean   :388.3   Mean   :394.6   Mean   :398.6   Mean   :404.1  
##  3rd Qu.:510.5   3rd Qu.:516.0   3rd Qu.:524.5   3rd Qu.:530.8  
##  Max.   :595.0   Max.   :612.0   Max.   :618.0   Max.   :628.0
str(rats_new)
## 'data.frame':    16 obs. of  13 variables:
##  $ ID   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Group: int  1 1 1 1 1 1 1 1 2 2 ...
##  $ WD1  : int  240 225 245 260 255 260 275 245 410 405 ...
##  $ WD8  : int  250 230 250 255 260 265 275 255 415 420 ...
##  $ WD15 : int  255 230 250 255 255 270 260 260 425 430 ...
##  $ WD22 : int  260 232 255 265 270 275 270 268 428 440 ...
##  $ WD29 : int  262 240 262 265 270 275 273 270 438 448 ...
##  $ WD36 : int  258 240 265 268 273 277 274 265 443 460 ...
##  $ WD43 : int  266 243 267 270 274 278 276 265 442 458 ...
##  $ WD44 : int  266 244 267 272 273 278 271 267 446 464 ...
##  $ WD50 : int  265 238 264 274 276 284 282 273 456 475 ...
##  $ WD57 : int  272 247 268 273 278 279 281 274 468 484 ...
##  $ WD64 : int  278 245 269 275 280 281 284 278 478 496 ...
rats_new$Group <- factor(rats_new$Group)
rats_new$ID <- factor(rats_new$ID)
colnames(rats_new)
##  [1] "ID"    "Group" "WD1"   "WD8"   "WD15"  "WD22"  "WD29"  "WD36"  "WD43" 
## [10] "WD44"  "WD50"  "WD57"  "WD64"

Let’s transform the data into long form and also, extract the times value and create a new column containing the integer value of it.

Finally, let’s create a boxplot of the data. From this we can see that the three groups seem to be different.

# Transform the data to long form
ratsL <-  pivot_longer(rats_new, cols = -c(Group, ID),
                       names_to = "WD", values_to = "weight") %>%
          arrange(WD)

#Create a new column of the extracted times value
ratsL <-  ratsL %>% 
            mutate(time = as.integer(substr(WD,3,4)))


#plot and group data by Groups
ggplot(ratsL, aes(x = time, y = weight, group = Group)) +
 geom_boxplot() +
 scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
  scale_y_continuous(name = "Weight (grams)") +
  theme(legend.position = "top")

Let’s try standardizing the data (mean):

I’ll comment these out as it does not make any sense with this kind of data..

#standardize
#ratsL <- ratsL %>%
 # group_by(time) %>%
  #mutate(stdweight = (weight - mean(weight))/sd(weight)) %>%
  #ungroup()

# Plot again
#ggplot(ratsL, aes(x = time, y = stdweight, group = Group)) +
 # geom_boxplot()

Next, let’s plot and visualize the data with standard error of the mean (SEM) (according to Groups). The SEM looks fine for all the groups. Group 1 has the smallest SEM, and maybe group 2 the biggest. Anyways, nothing alarming here.

#let's create columns for mean and sem
n <- 16
ratsLS <- ratsL %>%
  group_by(Group, time) %>%
  summarise(mean = mean(weight), se = sd(weight)/sqrt(n)) %>%
  ungroup()

# ..and plot the mean profiles
ggplot(ratsLS, aes(x = time, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=5) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  theme(legend.position = "top") +
  scale_y_continuous(name = "mean +/- sem")

Next let’s create a boxplot of the data, showing the mean and possible outliers in the data.

The boxplot shows two possible outliers: one for Group 2 and one for Group 3 (first boxplot).

Using the method from the exercises, I tried removing the outliers from the data (second boxplot) but that only created “new” outliers, so I’m not sure what would be the best way to go around this. I would probably run a normality test (i.e., Shapiro-Wilks, not going to go there now), see what that looks like and then try different data transformations..

ratsL8S <- ratsL %>%
  filter(time > 0) %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(weight) ) %>%
  ungroup()

# Draw a boxplot of the mean versus treatment by Group
ggplot(ratsL8S, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=21, size=5, fill = "white") +
  scale_y_continuous(name = "mean(weight)")

#Let's try to filter some outliers (not sure what would be the best way to get rid of the third one)
ratsL8S1 <- ratsL8S %>%
  filter(mean < 550)
#ratsL8S2 <- ratsL8S1 %>%
  #filter(mean > 250)
#.. and plot again
ggplot(ratsL8S1, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=21, size=5, fill = "white") +
  scale_y_continuous(name = "mean(weight)")

I will not run a t-test here because it is for detecting differences between two groups (and we have three). Let’s just run a basic linear model with Anova. We can see that the means are significantly different between Groups (p < 0.001).

lm1 <- lm(mean ~ Group, data = ratsL8S1)
summary(lm1)
## 
## Call:
## lm(formula = mean ~ Group, data = ratsL8S1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.886  -3.080   3.273   9.250  14.386 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  263.716      5.165   51.06 2.09e-15 ***
## Group2       185.739      9.890   18.78 2.90e-10 ***
## Group3       262.080      8.945   29.30 1.56e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.61 on 12 degrees of freedom
## Multiple R-squared:  0.9877, Adjusted R-squared:  0.9857 
## F-statistic: 483.6 on 2 and 12 DF,  p-value: 3.387e-12
anova(lm1)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## Group      2 206390  103195   483.6 3.387e-12 ***
## Residuals 12   2561     213                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Part 2

Implement the analyses of Chapter 9 of MABS using the BPRS data.

“The BPRS data contains information of 40 male subjects who were randomly assigned to one of two treatment groups and each subject was rated on the brief psychiatric rating scale (BPRS) measured before treatment began (week 0) and then at weekly intervals for eight weeks. The BPRS assesses the level of 18 symptom constructs such as hostility, suspiciousness, hallucinations and grandiosity; each of these is rated from one (not present) to seven (extremely severe). The scale is used to evaluate patients suspected of having schizophrenia.”

BPRS data is available. Let’s start again by transforming treatment and subject as factors and converting the data to long form.

# BPRS data is available

#let's transform these as factors
BPRS_new$treatment <- factor(BPRS_new$treatment)
BPRS_new$subject <- factor(BPRS_new$subject)

# and glimpse the data
glimpse(BPRS_new)
## Rows: 40
## Columns: 11
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ week0     <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week1     <int> 36, 68, 55, 77, 75, 43, 61, 36, 43, 51, 34, 52, 32, 35, 68, …
## $ week2     <int> 36, 61, 41, 49, 72, 41, 47, 38, 39, 51, 34, 49, 36, 36, 65, …
## $ week3     <int> 43, 55, 38, 54, 65, 38, 30, 38, 35, 55, 41, 54, 31, 34, 49, …
## $ week4     <int> 41, 43, 43, 56, 50, 36, 27, 31, 28, 53, 36, 48, 25, 25, 36, …
## $ week5     <int> 40, 34, 28, 50, 39, 29, 40, 26, 22, 43, 36, 43, 25, 27, 32, …
## $ week6     <int> 38, 28, 29, 47, 32, 33, 30, 26, 20, 43, 38, 37, 21, 25, 27, …
## $ week7     <int> 47, 28, 25, 42, 38, 27, 31, 25, 23, 39, 36, 36, 19, 26, 30, …
## $ week8     <int> 51, 28, 24, 46, 32, 25, 31, 24, 21, 32, 36, 31, 22, 26, 37, …
# Convert data to long form
BPRSL <-  pivot_longer(BPRS_new, cols = -c(treatment, subject),
                       names_to = "weeks", values_to = "bprs") %>%
          arrange(weeks) 
BPRSL <-  BPRSL %>% 
            mutate(week = as.integer(substr(weeks,5,5)))
glimpse(BPRSL)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ weeks     <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0…
## $ bprs      <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
dim(BPRSL)
## [1] 360   5

Let’s also visualize the data in the two different groups (/treatments).

#plot the data
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))

Next, let’s run a random intercept model on the data. This takes into account the fact that the measurements are not independent. Let’s create a regression model with “subject” as the random factor, and test the effect of treatment and week on the outcome variable bprs. From the summary we can see that the response variable bprs changed significantly over time (weeks).

Let’s also create a random intercept model where both treatment and subject are random factors, and compare these two models against the original model. From the ANOVA table we can see that the model with both treatment and subject as random factors has the best fit (p < 0.001).

library(lme4)
names(BPRSL)
## [1] "treatment" "subject"   "weeks"     "bprs"      "week"
# let's create a regression model
lm2 <- lm(bprs ~ week + treatment, data = BPRSL)
summary(lm2)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16
BPRS_lmer <- lmer(bprs ~ treatment + weeks + (1 | subject), data = BPRSL, REML = FALSE)
BPRS_lmer1 <- lmer(bprs ~ treatment + weeks + (treatment | subject), data = BPRSL, REML = FALSE)
anova(BPRS_lmer, BPRS_lmer1)
## Data: BPRSL
## Models:
## BPRS_lmer: bprs ~ treatment + weeks + (1 | subject)
## BPRS_lmer1: bprs ~ treatment + weeks + (treatment | subject)
##            npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
## BPRS_lmer    12 2751.3 2798.0 -1363.7   2727.3                         
## BPRS_lmer1   14 2576.5 2630.9 -1274.2   2548.5 178.86  2  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Next, let’s try the random intercept and random slope model with interaction (of treatment * weeks) and compare this model with the one that did not include the interaction. From the ANOVA table we can see that the model with interaction effect of treatment * weeks is also a good fit (p < 0.05).

BPRS_lmer2 <- lmer(bprs ~ treatment + weeks + (treatment | subject) + treatment * weeks, data = BPRSL, REML = FALSE)
anova(BPRS_lmer1, BPRS_lmer2)
## Data: BPRSL
## Models:
## BPRS_lmer1: bprs ~ treatment + weeks + (treatment | subject)
## BPRS_lmer2: bprs ~ treatment + weeks + (treatment | subject) + treatment * weeks
##            npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_lmer1   14 2576.5 2630.9 -1274.2   2548.5                       
## BPRS_lmer2   22 2576.5 2662.0 -1266.3   2532.5 15.938  8    0.04328 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Finally, let’s draw a plot with the actual values and fitted values of our last model.

From the plots we can see that the observed and fitted values are quite similar, indicating a good fit of our model.

# draw the plot with the observed values of bprs
plot1 <- ggplot(BPRSL, aes(x = week, y = bprs, group = treatment)) +
  geom_boxplot(aes(linetype = treatment)) +
facet_grid(. ~ treatment, labeller = label_both) +
  scale_y_continuous(name="Observed bprs", limits=c(0, 100))

# Create a vector of the fitted values of bprs and add it to BPRSL data table
fitted_vector <- fitted(BPRS_lmer2)
BPRSL$Fitted <- fitted_vector

# draw the plot with the Fitted values of bprs
plot2 <- ggplot(BPRSL, aes(x = week, y = Fitted, group = treatment)) +
  geom_boxplot(aes(linetype = treatment)) +
facet_grid(. ~ treatment, labeller = label_both) +
  scale_y_continuous(name="Fitted bprs", limits=c(0, 100))

library(gridExtra)

grid.arrange(plot1, plot2)